Maximizing Component Quality in Bilingual Word-Aligned Segmentations

نویسندگان

  • Spyros Martzoukos
  • Christof Monz
  • Christophe Costa Florêncio
چکیده

Given a pair of source and target language sentences which are translations of each other with known word alignments between them, we extract bilingual phrase-level segmentations of such a pair. This is done by identifying two appropriate measures that assess the quality of phrase segments, one on the monolingual level for both language sides, and one on the bilingual level. The monolingual measure is based on the notion of partition refinements and the bilingual measure is based on structural properties of the graph that represents phrase segments and word alignments. These two measures are incorporated in a basic adaptation of the Cross-Entropy method for the purpose of extracting an N -best list of bilingual phrase-level segmentations. A straight-forward application of such lists in Statistical Machine Translation (SMT) yields a conservative phrase pair extraction method that reduces phrase-table sizes by 90% with insignificant loss in translation quality.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Information Theoretic Approach to Bilingual Word Clustering

We present an information theoretic objective for bilingual word clustering that incorporates both monolingual distributional evidence as well as cross-lingual evidence from parallel corpora to learn high quality word clusters jointly in any number of languages. The monolingual component of our objective is the average mutual information of clusters of adjacent words in each language, while the...

متن کامل

Bilingual Distributed Word Representations from Document-Aligned Comparable Data

We propose a new model for learning bilingual word representations from non-parallel document-aligned data. Following the recent advances in word representation learning, our model learns dense real-valued word vectors, that is, bilingual word embeddings (BWEs). Unlike prior work on inducing BWEs which heavily relied on parallel sentence-aligned corpora and/or readily available translation reso...

متن کامل

Bilingually motivated segmentation and generation of word translations using relatively small translation data sets

Out-of-vocabulary (OOV) bilingual lexicon entries is still a problem for many applications, including translation. We propose a method for machine learning of bilingual stem and suffix translations that are then used in deciding segmentations for new translations. Various state-of-the-art measures used to segment words into their sub-constituents are adopted in this work as features to be used ...

متن کامل

Evaluating Compound-to-compound Links in a Sub-sentence Aligned Bilingual Corpus through Example-based Element Recognition

This paper will present an algorithm that evaluates links between one-word compounds and two-word compounds in a bilingual corpus that has been aligned at the sub-sentence level. The phenomenon of linking one-word compounds to multi-word compounds is common when English is being linked to other Germanic languages, and it is difficult to get the links right in the alignment process. The algorith...

متن کامل

Bilingual Lexicon Generation Using Non-Aligned Signatures

Bilingual lexicons are fundamental resources. Modern automated lexicon generation methods usually require parallel corpora, which are not available for most language pairs. Lexicons can be generated using non-parallel corpora or a pivot language, but such lexicons are noisy. We present an algorithm for generating a high quality lexicon from a noisy one, which only requires an independent corpus...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014